feat(manifest): add ManifestFilterManager and ManifestMergeManager#652
feat(manifest): add ManifestFilterManager and ManifestMergeManager#652gty404 wants to merge 1 commit into
Conversation
Implement two manifest management classes for table write operations: - ManifestFilterManager: filters manifest entries by row filter expression, file path, or partition value; supports FailMissingDeletePaths validation. Rewrites manifests that contain matching files, marking entries as DELETED; passes through manifests that cannot contain matching files unchanged. - ManifestMergeManager: merges small manifests using greedy bin-packing, grouping by partition_spec_id (manifests with different specs are never merged). Oversized manifests pass through unchanged. ADDED entries from prior manifests become EXISTING when merged (matching Java semantics).
wgtmac
left a comment
There was a problem hiding this comment.
Java parity review: the overall structure looks reasonable, but a few details are not yet aligned with Java ManifestFilterManager/ManifestMergeManager semantics.
| for (const auto& de : delete_exprs_) { | ||
| ICEBERG_ASSIGN_OR_RAISE(auto* eval, GetMetricsEvaluator(metadata, spec_id, de)); | ||
| ICEBERG_ASSIGN_OR_RAISE(auto matches, eval->Evaluate(file)); | ||
| if (matches) return true; |
There was a problem hiding this comment.
This row-filter delete path is broader than Java's behavior. Java's ManifestFilterManager first computes a partition residual and then uses both inclusive and strict metrics: files are deleted only when rowsMustMatch is true, and data files where only some rows may match raise Cannot delete file where some, but not all, rows match filter. Here, an inclusive might match result is enough to mark the whole file as DELETED, so a file whose bounds straddle the predicate would be incorrectly removed instead of rejected.
| const TableMetadata& metadata, const std::shared_ptr<Snapshot>& base_snapshot, | ||
| const ManifestWriterFactory& writer_factory) { | ||
| // No base snapshot → nothing to filter | ||
| if (!base_snapshot) return std::vector<ManifestFile>{}; |
There was a problem hiding this comment.
Java still validates required deletes when the manifest list is null or empty: filterManifests calls validateRequiredDeletes() before returning. With FailMissingDeletePaths() set, this early return silently succeeds for a missing path on an empty/no-base snapshot, while Java reports Missing required files to delete.
| bool ManifestFilterManager::CanContainDeletedFiles(const ManifestFile& manifest, | ||
| const TableMetadata& metadata) { | ||
| // A manifest with no live files cannot contain files to delete. | ||
| bool has_live = (manifest.added_files_count.value_or(0) > 0) || |
There was a problem hiding this comment.
This treats missing manifest counts as zero live files. In Iceberg/Java, null addedFilesCount or existingFilesCount means the count is unknown and ManifestFile.hasAddedFiles/hasExistingFiles returns true. Skipping such manifests can miss deletes for older or compatible metadata where counts are not populated.
| all.insert(all.end(), new_manifests.begin(), new_manifests.end()); | ||
| all.insert(all.end(), existing_manifests.begin(), existing_manifests.end()); | ||
|
|
||
| if (!merge_enabled_ || std::cmp_less(all.size(), min_count_to_merge_)) { |
There was a problem hiding this comment.
This global min_count_to_merge_ check does not match Java. Java only applies minCountToMerge to the bin containing the first/newest manifest; older bins that do not contain first can still be merged even when the total input count is below the threshold. Returning all manifests here can skip merges Java would perform.
| std::vector<ManifestFile> result; | ||
|
|
||
| for (const auto& manifest : group) { | ||
| if (manifest.manifest_length >= target_size_bytes_) { |
There was a problem hiding this comment.
Java uses ListPacker.packEnd(group, ManifestFile::length) with lookback 1 specifically to preserve manifest order and leave the under-filled bin at the beginning. This forward hand-rolled packing keeps the current small bin open across an oversized manifest, so a sequence like [small, oversized, small] can merge the two small manifests across the oversized one and reorder output relative to Java.
| const ManifestFile& first = all[0]; | ||
|
|
||
| // Group manifests by partition_spec_id — never merge across specs | ||
| std::map<int32_t, std::vector<ManifestFile>> by_spec; |
There was a problem hiding this comment.
Java groups manifests by partition spec using a reverse-order TreeMap. This std::map iterates specs in ascending order, so the merged manifest list order can differ from Java. For v3 tables this ordering is observable because manifest-list writing assigns first-row IDs in output order.
Implement two manifest management classes for table write operations:
ManifestFilterManager: filters manifest entries by row filter expression, file path, or partition value; supports FailMissingDeletePaths validation. Rewrites manifests that contain matching files, marking entries as DELETED; passes through manifests that cannot contain matching files unchanged.
ManifestMergeManager: merges small manifests using greedy bin-packing, grouping by partition_spec_id (manifests with different specs are never merged). Oversized manifests pass through unchanged. ADDED entries from prior manifests become EXISTING when merged (matching Java semantics).